.Rmd}install.packages("put_package_name_here") into
the consoleinstall.packages("tidyverse")library("put_package_name_here") into your code
chunklibrary("tidyverse")# Load Relevant Packages
library(tidyverse) # piping `%>%`, plotting, reading data
library(skimr) # exploratory data summary
library(naniar) # exploratory plots
library(kableExtra) # tables
library(lubridate) # for date variables
library(plotly)
# Extension
# Try to save time by installing a package to install packages!
# install.packages("pacman")
# pacman::p_load(tidyverse, skimr, naniar, kableExtra, lubridate, plotly)
If the data is online:
This is easy for datasets that are not too large & already hosted online!
If the data is a local file:
If you are knitting your document: it will automatically look for data in the same folder as your R file. So you should have no dramas (per step 1.).
If you are running a section of code: you will need to
specify which folder the data is in. The best way to do this is by
following these menu options:
Session Menu >> Set Working Directory >> To Source File Location.
read_csv() function which comes
from the tidyverse package.# Load Data
data = read_csv("https://www.maths.usyd.edu.au/u/UG/OL/OLEO5605/r/NYC_Drinking_Water.csv")
#data = read_csv("Drinking_Water_Quality_Distribution_Monitoring_Data.csv")
# Glimpse Function [From tidyverse package]
data %>% glimpse()
## Rows: 72,709
## Columns: 10
## $ `Sample Date` <chr> "01/01/2015", "01/01/2015", "01/…
## $ `Sample Time` <time> 12:19:00, 11:15:00, 10:09:00, 1…
## $ `Sample Site` <chr> "1S07", "1S04", "1S03A", "1S03B"…
## $ `Sample class` <chr> "Operational", "Operational", "O…
## $ Location <chr> "SS - Shaft 7 of City Tunnel No.…
## $ `Residual Free Chlorine (mg/L)` <dbl> 0.58, 0.71, 0.79, 0.77, 0.74, 0.…
## $ `Turbidity (NTU)` <dbl> 0.96, 0.94, 0.93, 0.93, 0.95, 1.…
## $ `Fluoride (mg/L)` <dbl> 0.79, 0.80, 0.79, 0.80, NA, NA, …
## $ `Coliform (Quanti-Tray) (MPN /100mL)` <chr> "<1", "<1", "<1", "<1", "<1", "<…
## $ `E.coli(Quanti-Tray) (MPN/100mL)` <chr> "<1", "<1", "<1", "<1", "<1", "<…
# Skim Function [From skimr package]
data %>% skim()
| Name | Piped data |
| Number of rows | 72709 |
| Number of columns | 10 |
| _______________________ | |
| Column type frequency: | |
| character | 6 |
| difftime | 1 |
| numeric | 3 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Sample Date | 0 | 1 | 10 | 10 | 0 | 1673 | 0 |
| Sample Site | 0 | 1 | 4 | 5 | 0 | 398 | 0 |
| Sample class | 0 | 1 | 9 | 20 | 0 | 7 | 0 |
| Location | 0 | 1 | 29 | 152 | 0 | 1263 | 0 |
| Coliform (Quanti-Tray) (MPN /100mL) | 0 | 1 | 1 | 6 | 0 | 47 | 0 |
| E.coli(Quanti-Tray) (MPN/100mL) | 0 | 1 | 1 | 2 | 0 | 3 | 0 |
Variable type: difftime
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| Sample Time | 2 | 1 | 17520 secs | 57780 secs | 10:10:00 | 463 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Residual Free Chlorine (mg/L) | 3 | 1.00 | 0.58 | 0.21 | -9.99 | 0.45 | 0.59 | 0.72 | 2.20 | ▁▁▁▁▇ |
| Turbidity (NTU) | 506 | 0.99 | 0.74 | 0.28 | 0.07 | 0.62 | 0.73 | 0.86 | 33.80 | ▇▁▁▁▁ |
| Fluoride (mg/L) | 63336 | 0.13 | 0.71 | 0.06 | 0.03 | 0.69 | 0.71 | 0.73 | 0.89 | ▁▁▁▇▇ |
# Summary Function [From base package -- preinstalled!]
data %>% summary()
## Sample Date Sample Time Sample Site Sample class
## Length:72709 Length:72709 Length:72709 Length:72709
## Class :character Class1:hms Class :character Class :character
## Mode :character Class2:difftime Mode :character Mode :character
## Mode :numeric
##
##
##
## Location Residual Free Chlorine (mg/L) Turbidity (NTU)
## Length:72709 Min. :-9.9900 Min. : 0.0700
## Class :character 1st Qu.: 0.4500 1st Qu.: 0.6200
## Mode :character Median : 0.5900 Median : 0.7300
## Mean : 0.5845 Mean : 0.7385
## 3rd Qu.: 0.7200 3rd Qu.: 0.8600
## Max. : 2.2000 Max. :33.8000
## NA's :3 NA's :506
## Fluoride (mg/L) Coliform (Quanti-Tray) (MPN /100mL)
## Min. :0.03 Length:72709
## 1st Qu.:0.69 Class :character
## Median :0.71 Mode :character
## Mean :0.71
## 3rd Qu.:0.73
## Max. :0.89
## NA's :63336
## E.coli(Quanti-Tray) (MPN/100mL)
## Length:72709
## Class :character
## Mode :character
##
##
##
##
# vis_miss function [From visdat or naniar packages]
vis_miss(data, warn_large_data = FALSE)
data %>%
select(where(is.numeric)) %>%
pivot_longer(everything(),
names_to = "Variable",
values_to = "Value") %>%
group_by(Variable) %>%
summarise(Mean = mean(Value, na.rm = TRUE),
Median = median(Value, na.rm = TRUE)) %>%
kable() %>% # putting into a table
kable_styling(bootstrap_options = c("hover")) # making table look good
| Variable | Mean | Median |
|---|---|---|
| Fluoride (mg/L) | 0.7144069 | 0.71 |
| Residual Free Chlorine (mg/L) | 0.5844564 | 0.59 |
| Turbidity (NTU) | 0.7385309 | 0.73 |
p = data %>%
select(where(is.numeric)) %>%
pivot_longer(everything(),
names_to = "Variable",
values_to = "Value") %>%
filter(!is.na(Value)) %>%
ggplot(aes(x = Value)) +
facet_wrap(~ Variable, scales = "free_y") +
geom_histogram(bins = 30, fill = "lightblue", color = "darkblue") +
labs(y = "Frequency") +
theme_bw()
ggplotly(p)
p = data %>%
select(where(is.numeric)) %>%
pivot_longer(everything(),
names_to = "Variable",
values_to = "Value") %>%
filter(!is.na(Value)) %>%
ggplot(aes(y = Value)) +
facet_wrap(~ Variable, scales = "free_y") +
geom_boxplot(fill = "lightblue", color = "darkblue") +
theme_bw() +
theme(strip.text.x = element_text(size = 8))
p
# See what state things are currently in
data %>% glimpse()
## Rows: 72,709
## Columns: 10
## $ `Sample Date` <chr> "01/01/2015", "01/01/2015", "01/…
## $ `Sample Time` <time> 12:19:00, 11:15:00, 10:09:00, 1…
## $ `Sample Site` <chr> "1S07", "1S04", "1S03A", "1S03B"…
## $ `Sample class` <chr> "Operational", "Operational", "O…
## $ Location <chr> "SS - Shaft 7 of City Tunnel No.…
## $ `Residual Free Chlorine (mg/L)` <dbl> 0.58, 0.71, 0.79, 0.77, 0.74, 0.…
## $ `Turbidity (NTU)` <dbl> 0.96, 0.94, 0.93, 0.93, 0.95, 1.…
## $ `Fluoride (mg/L)` <dbl> 0.79, 0.80, 0.79, 0.80, NA, NA, …
## $ `Coliform (Quanti-Tray) (MPN /100mL)` <chr> "<1", "<1", "<1", "<1", "<1", "<…
## $ `E.coli(Quanti-Tray) (MPN/100mL)` <chr> "<1", "<1", "<1", "<1", "<1", "<…
# Data Cleaning
## Changing type of `Sample Date`
# Note: mdy from [From tidyverse package]
data = data %>%
mutate(`Sample Date` = mdy(`Sample Date`))
# Adding additional time columns
data = data %>%
mutate(`Week of Year` = week(`Sample Date`),
`Weekday` = wday(`Sample Date`),
`Month Number` = month(`Sample Date`),
`Hour` = hour(`Sample Time`))
# Giving Month Name an Order
data = data %>%
mutate(`Month` = factor(month.name[`Month Number`],
levels = month.name))
# Converting Categorical Variables to Factors
data = data %>%
mutate(across(where(is_character) & !c(Location, `Sample Site`),
as_factor))
# Drop NA -- Q: Do you think this is appropriate?
# data = data %>%
# drop_na()
# Check we're happy with cleaned data
data %>% glimpse()
## Rows: 72,709
## Columns: 15
## $ `Sample Date` <date> 2015-01-01, 2015-01-01, 2015-01…
## $ `Sample Time` <time> 12:19:00, 11:15:00, 10:09:00, 1…
## $ `Sample Site` <chr> "1S07", "1S04", "1S03A", "1S03B"…
## $ `Sample class` <fct> Operational, Operational, Operat…
## $ Location <chr> "SS - Shaft 7 of City Tunnel No.…
## $ `Residual Free Chlorine (mg/L)` <dbl> 0.58, 0.71, 0.79, 0.77, 0.74, 0.…
## $ `Turbidity (NTU)` <dbl> 0.96, 0.94, 0.93, 0.93, 0.95, 1.…
## $ `Fluoride (mg/L)` <dbl> 0.79, 0.80, 0.79, 0.80, NA, NA, …
## $ `Coliform (Quanti-Tray) (MPN /100mL)` <fct> <1, <1, <1, <1, <1, <1, <1, <1, …
## $ `E.coli(Quanti-Tray) (MPN/100mL)` <fct> <1, <1, <1, <1, <1, <1, <1, <1, …
## $ `Week of Year` <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ Weekday <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,…
## $ `Month Number` <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ Hour <int> 12, 11, 10, 10, 9, 8, 9, 11, 7, …
## $ Month <fct> January, January, January, Janua…
data %>% summary()
## Sample Date Sample Time Sample Site
## Min. :2015-01-01 Length:72709 Length:72709
## 1st Qu.:2016-04-03 Class1:hms Class :character
## Median :2017-06-26 Class2:difftime Mode :character
## Mean :2017-06-14 Mode :numeric
## 3rd Qu.:2018-09-11
## Max. :2019-10-31
##
## Sample class Location Residual Free Chlorine (mg/L)
## Operational :27733 Length:72709 Min. :-9.9900
## Compliance :44289 Class :character 1st Qu.: 0.4500
## Resample_Compliance : 472 Mode :character Median : 0.5900
## Resample_Operational: 1 Mean : 0.5845
## Entry Point : 168 3rd Qu.: 0.7200
## Re-Sample : 33 Max. : 2.2000
## Op-resample : 13 NA's :3
## Turbidity (NTU) Fluoride (mg/L) Coliform (Quanti-Tray) (MPN /100mL)
## Min. : 0.0700 Min. :0.03 <1 :72436
## 1st Qu.: 0.6200 1st Qu.:0.69 1 : 102
## Median : 0.7300 Median :0.71 >200.5 : 33
## Mean : 0.7385 Mean :0.71 2 : 28
## 3rd Qu.: 0.8600 3rd Qu.:0.73 3.1 : 20
## Max. :33.8000 Max. :0.89 4.2 : 9
## NA's :506 NA's :63336 (Other): 81
## E.coli(Quanti-Tray) (MPN/100mL) Week of Year Weekday
## <1:72704 Min. : 1.00 Min. :1.000
## - : 1 1st Qu.:14.00 1st Qu.:2.000
## 1 : 4 Median :26.00 Median :4.000
## Mean :26.15 Mean :4.019
## 3rd Qu.:38.00 3rd Qu.:6.000
## Max. :53.00 Max. :7.000
##
## Month Number Hour Month
## Min. : 1.000 Min. : 4.00 August : 6972
## 1st Qu.: 4.000 1st Qu.: 9.00 July : 6851
## Median : 6.000 Median :10.00 May : 6796
## Mean : 6.422 Mean : 9.68 June : 6630
## 3rd Qu.: 9.000 3rd Qu.:11.00 September: 6578
## Max. :12.000 Max. :16.00 January : 6521
## NA's :2 (Other) :32361
top_10_turbidity = data %>%
arrange(desc(`Turbidity (NTU)`)) %>%
select(`Sample Date`, `Turbidity (NTU)`) %>%
head(10)
top_10_turbidity %>%
kable(caption = "Top 10 Turbidity readings") %>%
kable_styling(bootstrap_options = c("hover"))
| Sample Date | Turbidity (NTU) |
|---|---|
| 2018-01-16 | 33.80 |
| 2017-05-19 | 27.10 |
| 2019-05-17 | 14.10 |
| 2017-12-15 | 6.97 |
| 2018-01-23 | 6.97 |
| 2017-11-27 | 6.68 |
| 2017-12-13 | 6.29 |
| 2016-05-16 | 6.04 |
| 2017-12-29 | 5.96 |
| 2018-02-21 | 5.89 |
The highest Turbidity rating was on 2018-01-16 with a reading of 33.8.
class_medians = data %>%
group_by(`Sample class`) %>%
summarise(med_chlorine = median(`Residual Free Chlorine (mg/L)`, na.rm = TRUE),
med_turbidity = median(`Turbidity (NTU)`, na.rm = TRUE),
med_flouride = median(`Fluoride (mg/L)`, na.rm = TRUE))
class_medians %>%
kable() %>%
kable_styling(bootstrap_options = c("hover"))
| Sample class | med_chlorine | med_turbidity | med_flouride |
|---|---|---|---|
| Operational | 0.69 | 0.75 | 0.71 |
| Compliance | 0.52 | 0.72 | 0.71 |
| Resample_Compliance | 0.47 | 0.70 | NA |
| Resample_Operational | 0.75 | 0.98 | NA |
| Entry Point | 0.63 | 0.72 | 0.72 |
| Re-Sample | 0.39 | 0.67 | NA |
| Op-resample | 0.73 | 0.70 | 0.72 |
p = data %>%
filter(`Sample class` == "Entry Point" |
`Sample class` == "Operational") %>%
ggplot(aes(x = `Sample class`,
y = `Residual Free Chlorine (mg/L)`)) +
geom_boxplot(fill = "lightblue", color = "darkblue") +
labs(title = "Residual Free Chlorine (mg/L) for different sample classes") +
theme_bw()
ggplotly(p)
site_medians_wide = data %>%
group_by(`Sample Site`) %>%
summarise(med_chlorine = median(`Residual Free Chlorine (mg/L)`, na.rm = TRUE),
med_turbidity = median(`Turbidity (NTU)`, na.rm = TRUE),
med_fluoride = median(`Fluoride (mg/L)`, na.rm = TRUE))
# Tidy Way
site_medians_long = site_medians_wide %>%
pivot_longer(!c(`Sample Site`),
names_to = "Median Type",
values_to = "Median Value")
max_min_median_sites = site_medians_long %>%
group_by(`Median Type`) %>%
summarise(
Min_Val = min(`Median Value`, na.rm = TRUE),
Min_Site = paste(`Sample Site`[which(`Median Value` == Min_Val)],
collapse = ", "),
Max_Val = max(`Median Value`, na.rm = TRUE),
Max_Site = paste(`Sample Site`[which(`Median Value` == Max_Val)],
collapse = ", ")
)
max_min_median_sites %>%
kable() %>%
kable_styling(bootstrap_options = c("hover"))
| Median Type | Min_Val | Min_Site | Max_Val | Max_Site |
|---|---|---|---|---|
| med_chlorine | 0.11 | 77750 | 0.91 | 1S03A |
| med_fluoride | 0.69 | 32350 | 0.81 | 1S04 |
| med_turbidity | 0.45 | 3SC26 | 1.03 | 51900 |
# Non-Tidy Way -- copy code for each Median Type
# site_medians_wide %>%
# select(`Sample Site`, med_turbidity) %>%
# arrange(desc(med_turbidity)) %>%
# filter(row_number() %in% c(1, n()))
#
# site_medians_wide %>%
# select(`Sample Site`, med_fluoride) %>%
# arrange(desc(med_fluoride)) %>%
# filter(row_number() %in% c(1, n()))
#
# site_medians_wide %>%
# select(`Sample Site`, med_chlorine) %>%
# arrange(desc(med_chlorine)) %>%
# filter(row_number() %in% c(1, n()))
site_names = max_min_median_sites %>%
filter(`Median Type` == "med_turbidity") %>%
select(`Min_Site`, `Max_Site`) %>%
t()
p = data %>%
filter(`Sample Site` %in% site_names) %>%
ggplot(aes(x = `Turbidity (NTU)`, fill = `Sample Site`)) +
geom_histogram(bins = 30, color = "white") +
scale_fill_manual(values = c("lightblue", "darkblue")) +
theme_bw() +
labs(y = "Frequency")
ggplotly(p)
p = data %>%
filter(`Sample Site` %in% site_names) %>%
ggplot(aes(x = `Sample Date`, y = `Turbidity (NTU)`, color = `Sample Site`))+
geom_line() +
scale_color_manual(values = c("lightblue", "darkblue")) +
theme_bw()
ggplotly(p)
p = data %>%
group_by(`Sample Date`) %>%
summarise(med_chlorine = median(`Residual Free Chlorine (mg/L)`, na.rm = TRUE),
med_turbidity = median(`Turbidity (NTU)`, na.rm = TRUE),
med_fluoride = median(`Fluoride (mg/L)`, na.rm = TRUE)) %>%
pivot_longer(!c(`Sample Date`),
names_to = "Median Type",
values_to = "Median Value") %>%
ggplot(aes(x = `Sample Date`, y = `Median Value`, colour = `Median Type`)) +
geom_line() +
scale_color_manual(values = c("#ab5f54", "lightblue", "darkblue")) +
theme_bw()
ggplotly(p)
p = data %>%
group_by(Month) %>%
ggplot(aes(x = Month, y = log(`Turbidity (NTU)`))) +
geom_boxplot(fill = "lightblue", color = "darkblue") +
theme_bw() +
theme(text = element_text(size = 10),
axis.text.x = element_text(angle = 45, vjust = -0.5)) +
labs(y = "Log of Turbidity (NTU)")
ggplotly(p)
The United States Environmental Protection Agency (EPA) publishes regulations for what makes water safe to drink. Here are the regulations for some of the features contained in the NYC drinking water quality dataset:
My research question is whether all sample sites and sample types adhere to these requirements for safe drinking water. For the sake of data completeness, I will focus only on the observations which have non-missing data for all variables:
data_complete <- data %>%
tidyr::drop_na()
Since three of these five metrics require month-level aggregation for
analysis, I will analyse the data at the level of each month. First, I
will rename column names to have periods instead of spaces using
make.names:
data_complete <- data_complete %>%
# Replace spaces with periods in column names
rename_all(make.names)
It was also observed that one data point has “-” as the value for Coliform and E.Coli, indicating this data point is missing. Coincidentally, this observation is also missing fluoride levels. This observation will be dropped for the sake of data completeness:
data_agg <- data_complete %>%
# Extract year from date
mutate(Year = year(Sample.Date)) %>%
# Create binary present/absent variables for Coliform and E.Coli
mutate(Coliform_Positive = Coliform..Quanti.Tray...MPN..100mL. != "<1",
E_Coli_Positive = E.coli.Quanti.Tray...MPN.100mL. != "<1") %>%
group_by(Month.Number, Year, Sample.Site, Sample.class) %>%
summarise(num_obs = n(),
Chlorine_Good = all(Residual.Free.Chlorine..mg.L. < 4),
Fluoride_Good = all(Fluoride..mg.L. < 4),
Turbidity_Good = all(Turbidity..NTU. < 1) & (sum(Turbidity..NTU. <= 0.3) / num_obs) >= 0.95,
Coliform_Good = sum(Coliform_Positive)/n() < 0.05,
E_Coli_Good = sum(E_Coli_Positive)/n() < 0.05)
First, I will visualize the distribution of each of the five testing conditions across all observations by month:
library(cowplot)
theme_set(theme_cowplot())
data_agg %>%
pivot_longer(cols = c(Chlorine_Good:E_Coli_Good),
names_to = "Condition",
values_to = "Result") %>%
mutate(Condition = str_replace_all(Condition, "_Good", "")) %>%
ggplot(data=., mapping=aes(x=Condition)) +
geom_bar(aes(fill=Result)) +
coord_flip() +
xlab("Condition Tested") +
ylab("Observation by Month")
As the stacked bar plot shows, most observations adhere to guidelines
for fluoride, E.coli, coliform, and chlorine; however, only a minority
of monthly observations adhere to turbidity guidelines.
This minority of samples meeting all requirements can be examined further:
library(DT)
data_agg %>%
filter(Chlorine_Good,
Fluoride_Good,
Turbidity_Good,
Coliform_Good,
E_Coli_Good) %>%
arrange(Sample.Site, Year, Month.Number) %>%
DT::datatable()
There are a total of 79 observations by month that fulfill all of the requirements for chlorine, fluoride, turbidity, coliform, and E.coli. We can break this down by site to see if these are all different site (i.e. 79 different sites/observation types), or if some sites consistently meet these requirements across different months:
data_agg %>%
group_by(Sample.Site, Sample.class) %>%
mutate(Number_of_Time_Points = n()) %>%
filter(Chlorine_Good,
Fluoride_Good,
Turbidity_Good,
Coliform_Good,
E_Coli_Good) %>%
dplyr::summarise(Number_of_Compliant_Time_Points = n(),
Percent_Compliant_Time_Points = 100*round(n()/Number_of_Time_Points, 4)) %>%
distinct() %>%
arrange(desc(Number_of_Compliant_Time_Points)) %>%
kable(.) %>%
kable_styling(full_width = F)
| Sample.Site | Sample.class | Number_of_Compliant_Time_Points | Percent_Compliant_Time_Points |
|---|---|---|---|
| 3SC26 | Operational | 20 | 45.45 |
| 1SCL1 | Operational | 11 | 22.00 |
| 33450 | Compliance | 11 | 57.89 |
| 33950 | Compliance | 10 | 52.63 |
| 11750 | Compliance | 8 | 44.44 |
| 31550 | Compliance | 4 | 22.22 |
| 18650 | Compliance | 3 | 15.79 |
| 30150 | Compliance | 3 | 16.67 |
| 32350 | Compliance | 3 | 15.79 |
| 1SCH3 | Operational | 2 | 5.00 |
| 1SCH3 | Entry Point | 1 | 100.00 |
| 1SCL1 | Entry Point | 1 | 100.00 |
| 1SCL1 | Op-resample | 1 | 50.00 |
| 31750 | Compliance | 1 | 5.26 |
As the table shows, there are a total of 14 sample site and sample class combinations that met the EPA requirements for safe drinking water at one or more time points. The top two entries are both operational samples from two different sites (3SC26 and 1SCL1). The latter of the two (site 1SCL1) also met all requirements for entry point and op-resample classes as well.
In this document, NYC drinking water quality data was explored using
data wrangling and visualization techniques. Data completeness was
assessed by tabulating NA values with summary functions and
using naniar::vis_miss(), which highlighted that data in
the Fluoride (mg/L) column was missing in 87.11% (63,336)
of observations. Given this fact, it would likely not be appropriate to
unilaterally drop any observations with NA values using
tidyr::drop_na(), as that would wipe out 87.11% of the data
right off the bat based on Fluoride (mg/L) alone. Since the
majority of data in this column is missing, imputation is likely not a
tractable option either; for downstream analysis, if this feature is
necessary to include and ~9,000 observations are sufficient, then
tidyr::drop_na() could be used. However, if fluoride levels
are not necessary to include, then this feature can be dropped; the only
other feature with missing data is Turbidity (NTU), in
which 506 observations (0.7%) are missing. In this case, imputation
techniques could be used to fill in those missing data and render the
dataset complete.
As the boxplots indicate, there were 100+ outliers in each of the evaluated variables (fluoride content, residual free chlorine content, and turbidity), with each feature exhibiting visibly skewed distribution. Fluoride was negative skewed, as evidenced by more outliers with values lower than the interquartile range (IQR). While residual free chlorine had one outlier with a large-magnitude negative observation, it is likely that this value is due to measurement error as it is impossible to have a negative quantity of chlorine; barring that one questionable observation, this variable exhibited a positive skew, given most outliers were greater than the IQR. For turbidity, it is difficult to be certain whether there are truly more outliers above the IQR or if there is overplotting for values below the IQR; at face value, it appears that turbidity also has a positive skew.
For data cleaning, several new features were generated based on the
Sample Date column to denote hour of the day, day of the
week, week number out of the calendar year, and month number out of the
calendar year. Categorical variables were additionally encoded as
factors. The data were then aggregated to summarise the dates with the
top 10 turbidity ratings; to compare median readings for fluoride,
chloride, and turbidity across sample sites; to compare the distribution
of chlorine levels across sample sites; to calculate which sample sites
have the highest versus lowest levels of fluoride, chloride, and
turbidity; and to visualize temporal trends in the median chloride,
fluoride, and turbidity across all sample sites.
In the research question component, it was investigated whether
sample sites and observation types met United States EPA requirements
for safe drinking water. EPA
guidelines for chlorine, fluoride, turbidity, Coliform, and E.coli
were independently sourced and are described above. In order to directly
compare sample types and classes for these features, only complete
observations were included (i.e. any rows with NA values
were dropped). Data were aggregated at the month level to work with
month-based requirements (e.g. <5% of samples with detected E.coli
per month) and the number of compliant sample types and samples were
visualized as a stacked bar chart using ggplot2::ggplot().
The month-aggregated data were then filtered to only those observations
for which all conditions were met, with the results displayed using an
interactive HTML table with DT::DT(). These data were
further aggregated to summarise the number of compliant months per
observation site and sample type, with the results displayed using a
static HTML table with knitr::kable().